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1. Foreword and notations 

The purpose of this work is to construct an explicit connection between discrete 
Stein characterizations and discrete information functionals (see [10] where sim- 
ilar considerations are discussed for continuous distributions). In doing so wc 
also provide two general Stein characterizations of discrete distributions, as well 
as a family of identities relating differences between expectations with what we 
call generalized score functions. In the context of Poisson approximation, our 
results allow in particular to construct bounds between the total variation dis- 
tance and (i) the so-called scaled Fisher information used, e.g., in [8], as well 
as (ii) the discrete Fisher information used, e.g., in [7]. We refer the reader to 
[6, 11] and [1] for relevant references and similar inequalities. 

Throughout the paper, we shall abuse of language and call discrete probability 
mass functions densities. Also, to avoid ambiguities related to division by 0, 
we adopt the convention that, whenever an expression involves the division by 
an indicator function 1a for some measurable set A, we are multiplying the 
expression by the said indicator function. In particular note how ratios of the 
iovm. p{x) / p{x) withp(x) some function do not necessarily simplify to 1. Finally, 
we adopt the convention that sums running over empty sets equal 0. 

1 Supported by a Mandat de Charge dc Recherche from the Fonds National de la Recherche 
Scicntifique, Communaute frangaise de Belgiquc. Christophe Ley is also member of ECARES. 

2 Supported by a Mandat de Charge dc Recherche from the Fonds National de la Recherche 
Scicntifique, Communaute frangaise de Belgiquc. 



C. Ley and Y. Swan. /Stein characterizations and information distances 



2 



2. First connection 

We start with a discrete version of the so-called density approach (see [2, 10, 13] 
for a description in the continuous case). 

Theorem 2.1 (Discrete density approach). Letp be a density with support S p C 
Z. For the sake of convenience, we choose S p = [a, b] := {a, a + 1, . . . , b} with 
a < b G ZU{±oo}. Let J- 1 (p) be the collection of all test functions f : Z — > R such 
that x i— >• f(x)p(x) is bounded on S p and f(a) = 0. LetA^h(x) := h(x+l) — h(x) 
be the forward difference operator and define 7i(-,p) '■ Z* — > K* through 



We draw the reader's attention to the similarity between the operator 7i and 
the operators introduced in [9, 10]: in the terminology of [9], our operator (2.1) 
allows for a discrete "location" -based parametric interpretation. 

Proof. The first statement is trivial. To see (2), consider for z G Z the functions 
/f defined through 



with l z {k) := (!(_«,,,] (fc) ~P P (X < z))I Sp (k) and P p (X < z) := Efc=-aoM*0- 
It is evident that x H> f%(x)p(x) is bounded and that /f (a) = by our conven- 
tion on sums, hence /f G Fi{p) for all z. Moreover we have A+ (/£ (x)p(x)) = 
l z (x)p(x). This result is direct for x < 6; for x = 6, A+(/|'(x)p(x))| x= h = f z J (b + 

l)p(6 + l) - y*(6)j>(6) = - Eti - since Eta = 

by definition of l z . It follows that this forward difference satisfies, for all z, the 
so-called Stein equation 

Ti(f p z ,p)(x) = l z (x). 
Consequently, we can use E [Ti(fi ,p)(X)] = to obtain 

P(X <znleS p ) = P(X G 5 p )Pp(X < z) 

for all 2 G Z. In other words, provided that P(X £ S p ) > 0, P(X < z \ X € 
S p ) = P P (X < z) = P(Z < z) for all z G Z, whence the claim. □ 

Note that the choice of a "connected" support is for convenience only, and 
straightforward arguments allow to adapt the result to supports of the form 
[a, b] U [c, d] with c> b. Likewise the use of a forward difference in the expression 
of the operator is purely arbitrary and minor adaptations (e.g., setting f(b) = 
instead of /(a) = 0) allow to reformulate (2.1) in terms of backward differences 
as well. 




(2.1) 



Let Z ~ p and let X be a real-valued discrete random variable. 



(1) IfX^Z then E [Ti(f,p)(X)} = for all f G Ti(p). 

(2) IfE[Ti(f, P )(X)} = for all f G Ji(p), tfcen X|XeS p = Z. 
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Example 2.1. It is perhaps informative to see how the operator T\{f ', p) spells 
out in certain specific examples. 

1. Take p(x) = e~ X x /x\ In(^) the density of a mean-X Poisson random vari- 
able. Then (abusing notations) J r i(Po(A)) contains the set of bounded func- 
tions f with /(O) = 0, and simple computations show that the operator be- 
comes 

Ti(f,Po(\))(x) = (^/(x + 1) -/(.*))%(*). 

2. Take p to be a member of Ord's family, i.e. suppose that there exist s(x) and 
t(x) such that 

p(x + 1) s(x) + t{x) 
p{x) s(x + l) 

For an explanation on these notations see [12]. The collection J-\({s,t)) con- 
tains the set of all functions of the form f(x) = fo(x)s(x) with fo bounded 
and, for these f , the operator writes out 

Ti(f,(s,T))(x) = (a{x)+T(x))f (x+i)I [a , b] {x + l) - Mx)s(x)I [aM (x). 

We retrieve, up to some minor modifications, the operator presented in [12]; 
using the backward difference operator and functions f of the form /q(x)(s(x)+ 
t(x)) yields exactly the operator proposed in that paper. 

3. Write p as a Gibbs measure, i.e. p(x) = e v ( x >U) x / (x\Z)\\om (x) with N some 
positive integer, to > fixed, V a function mapping N to R and Z the nor- 
malizing constant. For an explanation on the notations see [4]. The collection 
IFi((V,cj)) contains the set of all functions of the form f(x) = xfo(x) with 
fo bounded and, for these f, the operator is of the form 7i(/, (V,uj))(x) = 
fa(x + l)e v( ~ x+1 ^ v ^ x) uj\ () ^{x + 1) - x/ (a;)I[o,jv](a;)- Supposing, as in [4], 
that fo(N + 1) = 0, the latter operator simplifies to 

7i(/, <y,u))(x) = [f (x + l)e v t*+V- v W U - xfo(x)) I l0 . N] (x), 

which corresponds to the Stein operator presented in [4]- 

Following the methodology introduced in [10], the next step consists in un- 
covering a factorization property of the operator (2.1) for two densities p and q. 
It will be fruitful to consider distributions p and q having non-equal supports. 
We choose to fix, for the sake of convenience (and for this sake only), S p = N 
and S q = [0, ...,N], for N E N U {oo}. Then, since we can always write 
1 = q(x)/q(x) + I[ N +i,...](x), we get 

r.c/.rtw- ^^^ + ^^-dWI , ( , 2) 

p{x) p{x) 
Now recall the product rule for discrete derivatives 

A+(h(x)g(x)) = h(x + l)A+g(x) + g(x)A+h(x). 
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Applying this and keeping in mind that we have set S q C S p , the first term on 
the rhs of (2.2) becomes 

f(x + l)q(x + l)_ A+ fp(x)\ , A+(f(x)q(x))p(x) 



p(x) x \q{x)J p(x) q(x) 

f p(x + l) q{x + iy\ A+(/(x)g(x)) 

= fix + 1] {~w) — ^rJ ^ (x) + «(x) • 

Therefore, letting 

r 1 {p,q)(x) := — -— I[o,...,jv](x ), 2.3 

V P( x ) 9W / 

we have just shown that, for all / £ Fi(p) H !Fi(q), we have the factorization 
property 

Ti (/, p) (x) = Ti (/, g) (x) + /(x + l)n (p, g) (a;) + e£ p (x) , (2.4) 

where 

e£ p (x) := /(x + l)p(x + l)/p(x)I [JV ,...] (x) - /(o:)V +1> ... ] (x). 

Remark 2.1. The statements above (and their consequences) are easily adapted 
to situations where S p C S q ; having in mind the context of a Poisson target p 
explains our willingness to restrict our choice. 

Now let I : Z — > R be a function such that E p [Z(Jf)] and E q [Z(X)] exist, with 
E r [Z(X)] := J2kes l(k) r (k) for a density r with support 5 r . Still following [10], 
it is immediate that the function 

x — 1 

fPy.Z^R-.x^— - E p [l(X)])p(k) (2.5) 

p[x> k=0 

is solution of the so-called Stein equation 71(/,p)(x) = Z(x) — E p [i(X)], so that, 
taking expectations and using (2.4), we get 

E q [l(X)]-E p [l(X)]=E q [fl l (X + l)r 1 (p,q)(X)]+e^ (2.6) 
with e» q (l) := q(N)f*j(N + l)p(N + l)/p(N). 

Remark 2.2. The error term e pq (l) in (2.6) will be negligible as N tends to 
infinity since, in general, the Stein solution will be bounded over N. This 
latter fact also ensures that /" , belongs to T\ (q) . 

We will apply (2.6) in the context of a Poisson target distribution in Section 4. 
In particular we will show how our approach provides a connection between the 
so-called total variation distance (as well as many other probability distances) 
and the scaled Fisher information in use for information theoretic approaches 
to Poisson approximation problems (see [1, 8, 11]). 
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3. A second connection 

The construction from the previous section (i.e. the factorization (2.4), the score 
function (2.3) and the identity (2.6)) is by no means unique, nor is the initial 
characterization from Theorem 2.1. There are, in fact, an infinite number of 
variations on the different steps outlined above, each providing a connection be- 
tween probability distances and different forms of information distances. Now it 
appears that, in the world of Poisson approximation, the scaled Fisher informa- 
tion is not the only "natural" measure of discrepancy and [7] (followed later by 
[1]) make use of another information distance which they call the discrete Fisher 
information. We choose to show how this specific distance can be obtained from 
our Stein characterizations as well. 

In [9] we propose a construction of Stein characterizations tailored for para- 
metric densities, that is densities depending on some real- valued parameter. In 
what follows, we shall denote by pe{x) the parametric density with parameter 
9 belonging to the parameter space 0. For the sake of simplicity we consider 
families with support S p = [a, b] := {a, a + 1, . . . , 6} with a < b G Z U {±00} not 
depending on (9; we also suppose that, for all x, the function 9 <— > pe{x) is con- 
tinuously diffcrentiable. (A similar result can also be obtained for integer-valued 
parameters 9.) We then obtain the following result, whose proof is omitted be- 
cause it is directly inspired from [9] and runs along the same lines as the proof 
of Theorem 2.1. 

Theorem 3.1 (Parametric discrete density approach). For 9 an interior point 
ofO, let p(x) := pe{x) be a parametric density with support S p cZ and define 
p(x) :~ dg{pg{x) I 'pe(a)) . Let T 2 (p) be the collection of all test functions f : Z — >• 
M such that x H> f(x)p(x) is bounded on S p . Define the operator T 2 (' ',p) ■ — > 
R* through 

T 2 (f,p) : Z R : x -> T 2 (f,p)(x) := 

p(x) 

Let Z ~ p and let X be a real-valued discrete random variable. 

(1) IfX = Z then E [T 2 (f,p)(X)} = for all f G F 2 {p). 

(2) IfE[T 2 (f,p)(X)) - for all f G F 2 {p), then X | X G S p = Z. 

We attract the reader's attention to the fact that, contrarily to T\(p) in 
Theorem 2.1, the class of test functions F 2 (p) here does not ask that /(a) = 0. 
This comes from the fact that, by definition, p(a) = 0, hence this requirement 
on the / can be dropped. 

Theorem 3.1 allows to recover the well-known Stein operators and charac- 
terizations of the Poisson, geometric, binomial distributions, to cite but these; 
we refer the reader to [9] for intuition about the perhaps unusual form of the 
operator, as well as for explicit computations and examples. 

From here onwards we restrict our attention to distributions p and q with 
full support N. Note that this entails that p{x) and q(x — 1) share the same 
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support No • While not strictly necessary, this assumption will yield considerable 
simplifications. It is, moreover, in line with the related literature when a Poisson 
target is to be considered (see [1]). 

Proceeding as in Section 2 (and keeping all supports implicit) we readily 
obtain 



r 2 (/,?)(*) 



A+ (/(*)?(* -1)^ 



p(x) 

= Aj(f(x)q(x~l))p(x + 1) + /W ?M) A+ ( P(x) 
q(x) p(x) p{x) x \q(x - 1) 

Straightforward simplifications then yield for / G J-2(p)nJ-2(q) the factorization 

T 2 (f,p){x) = f(x)r 2 (p,q)(x) + — — — (3.1) 

q{x) p{x) 

with 

Now let / : Z — > R be a function such that E P [Z(X)] and E q [l(X)} exist, and 
define 

x — l 

f'/x) :Z^R:x^— *£(l(k) - E p [l(X)])p(k). (3.3) 
p[x> fe=0 

Then clearly T2 (/f ; , p) (x) = /(x) — E P [Z(JC)] so that, taking expectations on 
both sides of (3.1) for this choice of test function, we obtain 

E g [l(X)]-E p [l(X)]=E q [fl l (X)r 2 (p,q)(X)} 



E 9 



A+(^(x),(x-l))| + 



g(X) p(X) 



Finally suppose that p(X + l)/p(X) simplifies to a constant (as is the case for 
a Poisson target). Then straightforward calculations lead to the analog of (2.6) 
for the score function namely 

E q [l(X)} -E p [l(X)} =E q [fl l (X)r 2 (p,q)(X)]. (3.4) 

As will be shown in Section 4, specifying a Poisson distribution for the target p 
in (3.4) yields the scaled score function whose variance is the so-called discrete 
Fisher information introduced in [7]. 

4. Applications to a Poisson target 

Working as in [10] it is easy to obtain, from (2.6) and (3.4), inequalities of the 
form 

dnfaq) := sup \E q [h{X)]-E p [h(X)]\ < K(p,q)J(p,q), 
hen 
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where % is, as usual, a suitably chosen class of functions, n(p,q) are constants 
depending on both p and q and S(p, q) is a so-called information distance be- 
tween p and q, which is given by the variance of one of the score functions (2.3) 
or (3.2) introduced in the two previous sections. The main difficulty then resides 
in computing the constants appearing in these inequalities and in putting the 
information distance J to good use. Such computations arc not the primary 
purpose of the present paper. Hence we choose to focus on a Poisson target, for 
which much is already known. From here onwards we therefore only consider 
p = Po(A), the mean-A Poisson density. 

We first adapt the results from Section 2. The score function (2.3) becomes 



n{Po(\),q)(x 
so that (2.6) yields 

E q [l(X)]-E p [l(X)} = E q 



A 



x + 1 



(x + l)q(x + l) \ T 
1 ^-rs I Ha....,N](x) 



X + l 



Xq(x) 



Va l- 



(X + l)q(X + l) 
Xq(X) 



m(0- 



(4.1) 

One recognizes, in the rhs of (4.1), the scaled score function whose variance 
yields the scaled Fisher information 



ld(Po(X),q) := XE q 



(X + l)q(X + l) 
Xq(X) 



1 



This information distance is subadditive over convolutions; this is useful when 
computing rates of convergence for sums towards the Poisson distribution (see, 
e.g., [1, 8]). Using a Poincare inequality, [8] show that, for q a discrete distribu- 
tion with mean A, 



||g - Po(X)\\ TV < v/2/d (Po( A), q), 

with || • \\tv indicating the total variation distance. From (4.1) and Holder's 
inequality we obviously recover a much more general result, namely 



d H (Po(X),q) = sup \E q [h(X)]-E PoW [h(X)]\ 
hen 



< H 1>n (Po(X),qWlC 1 (Po(X),q), 



(4.2) 



where the constant 



H 1M (Po(\),q) := sup 
hen 



\ 



E„ 




y/}d(Po(X),q) 



is some kind of general Stein (magic) factor. The notation H for these constants 
is borrowed from [1] where similar relationships are obtained, within the context 
of compound Poisson approximation. 
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Likewise, in the notations of Section 3, we have p(x) = X x ^ 1 j{x — l)!% (x) 
so that p(x + l)/p{x) = e x l^(x) and p(x)/p(x) = e x x/Xl^(x). Hence for all q 
with full support N we get 



r 2 (Po(X),q)(x)=e > 
so that (3.4) yields 



x (q(x-l) x 



q{x) 



E q [l(X)]-E v [l{X)}=E q 



J ) %(*) 



X 



(4.3) 



One recognizes, in the rhs of (4.3), a special instance of the Katti-Panjer score 
function introduced in [1, equation (3.1)] and whose variance yields our second 
information distance, namely the discrete Fisher information 



JC 2 {Po(X),q) :=E q 



( MiX-i) 



(4.4) 



This is easily shown to be related to the discrete Fisher information distance 
I(q) := E q [(q(X — l)/q(X) — l) 2 ] introduced in [7]. The information distance 
(4.4) has been shown to be subadditive over convolutions (see [1]). From (4.3) 
and Holder's identity we obviously recover the following general relationship 



d H (Po(X),q) = sup \E q [h(X)]-E PoW [h(X)]\ 
hen 



< H 2tH {Po{X),q)y/K 2 {.Po{\),q), 



(4.5) 



where the constant 



H 2M {Po(X),q) := WsupE, 
V hen 



(x-^f 2Ji (x))'' 



is, again, some kind of general Stein (magic) factor. 
We conclude the paper with explicit computations. 

Proposition 4.1. Take p = Po(X) and q a pdf with support [0, . . . ,N]. Then 

\\p-q\\rv < VXH(XWlC 1 (Po(X),q)+e q < and \\p-q\\ T v < H{\)s/K*{Po{\),q), 

(4.6) 

where the error term e* is of order q(N)/(N + 1) and H(X) = 1 A J^. The 
second bound in (4.6) only holds if N = oo. 
Proof. Choose 



h(x) := I 



lp(x)<q(x)] ~ l Hx)>q(x)] ~ 2I [p(x)<q(x)] - !• 



Then obviously E p [/i(X)] and E q [/i(X)] exist, and 

^2 \p{x) ~ q(x)\ = E q [h(X)} - E p [h(X)] 
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so that, by definition of the total variation distance, we get 

\\p-qhv = \{v q [h(x)]-E p [h(x)]). 

It now suffices to apply (4.2) and (4.5), respectively, to obtain the announced 
relationships. All that remains is to compute bounds on the constants. 

In the first case, known results on the properties of ff h show that the claim 

on the error term is evident. The expression for the constant yXH (A) is derived 
from the quantity 

r 'Vxfl h (x + i)\ 



E„ 



X + l 



with h specified (and bounded by 2). Indeed, from (2.5) and [5, Theorem 2.3], 
we get 



x + 1 



^tD*(*)-W)])^ 

fc=0 



< 1 A \ -r sup h(i) - inf h(i) 
w eA / ViGN tefi ' 



The constant H(X) in the second case is derived from 



E„ 



Actually, from (3.3) and [5, Theorem 2.3] we get 



A-V/|\(*) 



(x-1) 



x-1 



'^J2(h(k)-E p [h(X)})- 



k=0 




< I 1 A a/ —r I I svmhti) - inf h(i) 
eA / V jG n 



The claim follows. 



□ 



For A < 2/e, H(X) = 1 and hence the bounding constant for the scaled 
Fisher information K\(Po(\),q) becomes s/X < ^2/e; in case A > 2/e, this 
constant equals \/2/e. Since the error term e^ is either null for N = oo or 
negligible in comparison to the term involving the scaled Fisher information, 
our bounds on the total variation distance corresponding to the first inequality 
in (4.6) improve on those proposed in [8], where the bounding constant is given 
by y/2, while ours are inferior to s/2/e. For the sake of illustration we conclude 
this section by applying Proposition 4.1 to the three examples studied in [8]. 
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Example 4.1. Take Xi i.i.d. Bernoulli (X/n) random variables and let S n — 
Y^i—i X*- Put 1 = Ps n , the density associated with the sum S n . Then straight- 
forward calculations reveal that K,\(P 'o(A) , q) = X 2 /(n(n — X)) and e™ is of order 
X n /n n+1 . Consequently, we have 



\P Sn - Po(X)\\tv < 



2 A A 

c 



yjn(n - A) 



n+l 



for some positive constant c and sufficiently large n. This is an improvement 
over the (2 + e)X/n, e > 0, bound obtained in [8]. 

Example 4.2. Consider the same situation as above, but with A replaced by 
for some fx > 0. From the previous example, we directly deduce that 



\P Sn - Po(fiVn~)\\ T v < \/| 



fi M 



/or some positive constant c and sufficiently large n. Although the rate is good 
and the constant above is again an improvement over the one obtained in [8], it 
is still not as good as the optimal constant ^J\lif2/Ke) derived in [3]. 

Example 4.3. Finally take Xi independent geometric random variables with 
respective distributions Pi{x) = (1 — qi) x ' qil^(x) , where < q t < 1 for all 
i = 1, . . . , n. Let S n = J2i=i Xi and q — P$ n , the density associated with the 
sum S n . Put A = E[S n ]. The subadditivity property of K-i(Po(X),q) states that 
(see [8, Proposition 3]) 

ld{Po(X),P Sn )< V^f^/dCPo^.PxJ, 
7^1 Mi 



where Px { is the density associated with Xi and ei = E[JQ]. Straightforward 



computations show that /Ci(Po(e,),Pxi) = (1 — Qi) 2 /<]i- Since here e?° = 0. it 



follows that 

\\P Sn - Po{X)\\tv < 



± {1 



for sufficiently large n. Again we improve on the constant obtained in [8]. Note 
that restricting, as in [8], to the case where qi = n/(n + A) yields a rate of 
v /27i(A/Vn(n + A)). 

Next consider the second information functional K.2- Direct computations 
yield an expression for K,2{Po{\), Ps„) which we will dispense of here, and hence 
an explicit bound on \\Ps„ — Po(X)\\tv can also easily be obtained in terms of 
this functional as well. The general expression appears inscrutable, and hence we 
restricted our attention to the case where qi = n/(n + X). There, numerical eval- 
uations in Mathematica 7 encourage us to suggest that the second information 
distance provides a better rate than the yj2/e{\/ \Jn{n + A)) mentioned above, 
at least for moderate values of X and large values of n (that is, n > 100). 
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5. Final comments 

The results reported in the present work are to be read in conjunction with 
those reported in [10]. The main message of these two papers is that all the 
so-called Fisher information functionals used in the literature on Gaussian and 
Poisson approximation bear an interpretation in terms of a specific Stein char- 
acterization. As concluding remark to the present paper we wish to stress the 
fact that our method applies to many more distributions than just the Gaus- 
sian or the Poisson (e.g., the compound Poisson, allowing comparisons with the 
results of [1]), and in particular provides generalized scaled Fisher information 
distances between any two (nice) distributions. Of course much remains to be 
explored, in particular on the properties of these generalized information func- 
tionals. However the freedom of choice for the densities as well as for the test 
functions in (2.6), (3.4) and [10, Theorem 2.3] makes us confident that there 
remains much to be gained from a crafty usage of such identities. 
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